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We describe some of the key aspects of the SAMGrid system, used by the DO and CDF experiments at Fer- 
milab. Having sustained success of the data handling part of SAMGrid, we have developed new services for 
job and information services. Our job management is rooted in Condor-G and uses enhancements that are 
general applicability for HEP grids. Our information system is based on a uniform framework for configuration 
management based on XML data representation and processing. 



1. Introduction 

Grid has emerged as a modern trend in comput- 
ing, aiming to support the sharing and coordinated 
use of diverse resources by Virtual Organizations 
(VO's) in order to solve their common problems 0- It 
was originally driven by by scientific, and especially 
High-Energy Physics (HEP) communities. HEP ex- 
periments are a classic example of large, globally dis- 
tributed VO's whose participants are scattered over 
many institutions and collaborate on studies of exper- 
imental data, primarily on data processing and anal- 
ysis. 

Our background is specifically in the development of 
large-scale, globally distributed systems for HEP ex- 
periments. We apply grid technologies to our systems 
and develop higher-level, community-specific grid ser- 
vices (generally defined in |3l|), currently for the two 
collider experiments at Fermilab, DO and CDF. These 
two experiments are actually the largest currently run- 
ning HEP experiments, each having over half a thou- 
sand users and planning to analyze repeatedly peta- 
byte scale data. 

The success of the distributed computing for the 
experiments depends on many factors. In the HEP 
computing, which remains a principal application do- 
main for the Grid as a whole, jobs are data-intensive 
and therefore data handling is one of the most im- 
portant factors. For HEP experiments such as DO 
and CDF, data handling is the center of the meta- 
computing grid system 4]. The SAM data handling 
system was originally developed for the DO col- 
laboration and is currently also used by CDF. The 
system is described in detail elsewhere (see, for ex- 
ample, Q and references therein). Here, we only 



note some of the advanced features of the system - 
the ability to coordinate multiple concurrent accesses 
to Storage Systems 8] and global data routing and 
rephcation 

Given the ability to distribute data on demand glob- 
ally, we face the similar challenges of distributing the 
processing of the data. Generally, for this purpose we 
need global job scheduling and information manage- 
ment, which is a term we prefer over "monitoring" 
as we strive to include configuration management, re- 
source description, and logging. 

In recent years, we have been working on the SAM- 
Grid project 0, which addresses the grid needs of 
the experiments; our current focus is in the Jobs and 
Information Management (JIM), which is to comple- 
ment the SAM grid data handling system with ser- 
vices for job submission, brokering and execution as 
well as distributed monitoring. Together, SAM and 
JIM form SAMGrid, a "VO-specific" grid system. 

In this paper, we present some key ideas from our 
system's design. For job management per se, we 
collaborate with the Condor team to enhance the 
Condor-G middleware so as to enable scheduling of 
data-intensive jobs with flexible resource description. 
For information, we focus on describing the sites' re- 
sources in the tree- like structures of XML, with sub- 
sequent projections onto the Condor Classified Ad- 
vertisements (ClassAd) framework, monitoring with 
Globus MDS and other tools. 

The rest of the paper is organized as follows. We 
discuss the relevant job scheduling design issues and 
Condor enhancements in Section |21 In Section 13 we 
describe configuration management and monitoring. 
In Sectional we present the status of the project and 
interfaces with the experiments' computing environ- 
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ment and fabric; we conclude in Section |S| 



2. Job Scheduling and Brokering 

A key area in Grid computing is job management, 
which typically includes planning of job's dependen- 
cies, selection of the execution cluster(s) for the job, 
scheduling of the job at the cluster(s) and ensuring re- 
liable submission and execution. We base our solution 
on the Condor-G framework 11], a powerful Grid mid- 
dleware commonly used for distributed computing. 
Thanks to the Particle Physics Data Grid (PPDG) 
collaboration |l_2] , we have been able to work with the 
Condor team to enhance the Condor-G framework and 
then implement higher-level functions on top of it. In 
this Section, we first summarize the general Condor-G 
enhancements and then proceed to actually describing 
how we schedule data-intensive jobs for DO and CDF. 

2.1. The Grid Enhancements Condor 

We have designed three principal enhancements for 
Condor-G. These have all been successfully imple- 
mented by the Condor team: 

• Original Condor-G required users to either spec- 
ify which grid site would run a job, or to use 
Condor-G's Glidein technology. We have en- 
abled Condor-G to use a matchmaking service 
to automatically select sites for users. 

• We have extended the ClassAd language, used 
by the matchmaking framework to describe re- 
sources, to include externally supplied functions 
to be evaluated at match time. This allows the 
matchmaker to base its decision not only on ex- 
plicitly advertised properties but also on opaque 
logic that is not statically expressible in a Clas- 
sAd. Other uses include incorporation of infor- 
mation that is prohibitively expensive to publish 
in a ClassAd, such as local storage contents or 
lists of site-authorized Grid users. 

• We removed the restriction that the job submis- 
sion client had to be on the same machine as 
the queuing system and enabled the client to 
securely communicate with the queue across a 
network, thus creating a multi-tiered job sub- 
mission architecture. 

Fundamentally, these changes are sufficient to form 
a multi-user, multi-site job scheduling system for 
generic jobs. Thus, a novelty of our design is that we 
use the standard Grid technologies to create a highly 
reusable framework to the job scheduling, as opposed 
to writing our own Resource Broker, which would be 
specific to our experiments. 



In the remainder of this Section we described 
higher-level features for the job management, partic- 
ularly important for data-intensive applications. 

2.2. Combination of the Advertised and 
Queried Information in the MMS 

Classic matchmaking service (MMS) gathers infor- 
mation about the resources in the form of published 
Class Ads. This allows for a general and flexible frame- 
work for resource management (e.g. jobs and resource 
matching) , see [T3 | . There is one important limitation 
in that scheme, however, which has to do with the fact 
that the entities (jobs and resources) have to be able 
to express all their relevant properties upfront and ir- 
respective of the other party. 

Recall that our primary goal was to enable co- 
scheduling of jobs and data. In data-intensive com- 
puting, jobs are associated with long lists of data items 
(such as files) to be processed by the job. Similarly, 
resources are associated with long lists of data items 
located, in the network sense, near them. For exam- 
ple, jobs requesting thousands of files and sites having 
hundreds of thousands of files are not uncommon in 
production in the SAM system. Therefore, it would 
not be scalable to explicitly publish all the properties 
of jobs and resources in the ClassAds. 

Furthermore, in order to rank jobs at a resource (or 
resources for the job), we wish to include additional 
information that couldn't be expressed in the Clas- 
sAds at the time of publishing, i.e., before the match. 
Rather, we can analyze such an information during 
the match, in the context of the job request. For ex- 
ample, a site may prefer a job based on similar already 
scheduled data handling requests. ^ Another example 
of useful additional information, not specific to data- 
intensive computing, is the pre-authorization of the 
job's owner with the participating cite, by means of 
e.g. looking up the user's grid subject in the site's 
gridmapfile. Such a pre-authorization is not a re- 
placement of security, but rather a means of protect- 
ing the matchmaker from some blunders that other- 
wise tend to occur in practice. 

The original MMS scheme allowed for such addi- 
tional information incorporation only in the claiming 
phase, i.e., after the match when the job's scheduler 
actually contacts the machine. In the SAMGrid de- 
sign, we augment information processing by the MMS 
with the ability to query the resources with a job in 
the context. This is pictured in Figure ^ by arrows 
extending from the resource selector to the resources. 



Unlike the information about data already placed at sites, 
the information about scheduled data requests, and their es- 
timated time of completion, is not described by any popular 
concept like replica catalog. 
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Figure 1: The job management architecture in SAMGrid. 1,2 - Jobs are submitted while resources are advertised, 3 - 
MMS matches jobs with resources, 4,5 - ranking lunctions retrieve additional information from the data handling 
system, 6,7 - resource is selected and the job is scheduled. 



specifically to the local data handling agents, in the 
course of matching. It is implemented by means of 
externally supplied ranking function whose evaluation 
involves remote call invocation at the resources' sites. 
Specifically, the resource ClassAds in our design con- 
tain pointers to additional information providers (data 
handling servers called Stations): 

Station_ID = foo 

and the body of the MMS ranking fimction called from 
the job class Ad 

Rctnk = fun( job_dataset , 

OTHER. Statioii_ID) 

includes logic similar to this pseudo-code: 

station = resolve (Station_ID, . . .) 
return station-> 
get_preference(job_dataset, . . .) 

In the next subsection, we discuss how we believe this 
will improve the co-scheduling of jobs with the data. 

2.3. Interfacing with the SAM Data 
Handling System 

The co-schcduling of jobs and data has always been 
critical for the SAM system, where at least a subset 
of HEP analysis jobs (as of the time of writing, the 
dominating class) have their latencies dominated by 
data access. Please note that the SAM system already 
implemented the advanced feature of retrieving multi- 
file datasets asynchronously with respect to the user 



jobs [11 m - this was done initially at the cluster 

level rather than at the grid level. 

Generally with the data-intensive jobs, we attempt 
to minimize the time to retrieve any missing data and 
the time to store output data, as these times propa- 
gate into the job's overall latency. As we try to min- 
imize the grid job latency, we ensure that the design 
of our system. Figure ^ is such that the data handling 
latencies will be taken into account in the process of 
job matching. This is a principal point of the present 
paper, i.e., while we do not yet possess sufficient real 
statistics that would justify certain design decisions, 
we stress that our system design enables the various 
strategies and supports considerations listed below. 

In the minimally intelligent implementation, we 
prefer sites that contain most of the the job's data. 
Our design does not rely on a replica catalogue be- 
cause in the general case, we need local metrics com- 
puted by and available from the data handling system: 

• The network speeds for connections to the 
sources of any missing data; 

• The depths of the queues of data requests for 
both input and output; 

• The network speeds for connections to the near- 
est destination of the output files. ^ 



^In the SAM system the concept of data routing is imple- 
mented such that the first transfer of an output file is seldom 
done directly to the final destination. 
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It is important that network speeds be provided by a 
high-level service in the data handling rather than by 
a low-level network sensor, for reasons similar to those 
why having e.g. a 56Kbps connection to the ISP does 
not necessarily enable one to actually download files 
from the Internet with that speed. 

3. The Management of Configuration and 
Information in JIM 

Naturally, being able to submit jobs and schedule 
them more or less efficiently is necessary but not suf- 
ficient for a Grid system. One has to understand how 
resources can be described for such decision making, 
as well as provide a framework for monitoring of par- 
ticipating clusters and user jobs. 

We consider these and other aspects of information 
management to be closely related to issues of Grid 
configuration. In the JIM project, we have developed 
a uniform configuration management framework that 
allows for generic grid services instantiation, which in 
turn gives flexibility in the design of the Grid as well 
as inter-operability on the Grid (see below). 

In this Section, we briefly introduce the main config- 
uration framework and then project it onto the various 
SAMGrid services having to do with information. 

3.1. The Core Configuration Mechanism 

Our main proposal is that grid sites be configured 
using a site-oriented schema, which describes both re- 
sources and services, and that grid instantiation at the 
sites be derived from these site configurations. We are 
not proposing any particular site schema at this time, 
although we hope for the Grid community as a whole 
to arrive at a common schema in the future which will 
allow reasonable variations such that various grids are 
still instantiatable. 

Figure 121 shows configuration derivation in the 
course of instantiation of a grid at a site. The site con- 
figuration is created using a meta-configurator similar 
to one we propose below. 

3.1 .1 . The Core Meta-Configurator and the Family of 
Configurators 

In our framework, we create site and all other con- 
figurations by a universal tool which wc call a meta- 
configurator, or configurator of configurators. The 
idea is to separate the process of querying the user 
for values of attributes from the schema that describes 
what those attributes are, how they should be queried, 
how to guess the default values, and how to derive val- 
ues of attributes from those of other attributes. Any 
concrete configurator uses a concrete schema to ask 
the relevant questions to the end user (site adminis- 
trator) in order to produce that site's configuration. 



Any particular schema is in turn derived from a meta- 
schema. Thus, the end configuration can be repre- 
sented as: 

C = c{Sd, lu) = c(c(S'o, Id), lu), 

where C is a particular configuration, c is the config- 
uration operation, Sd is a particular schema reflecting 
certain design, 5*0 is the meta-schema. Id and lu are 
the inputs of the designer and the user, respectively. 

In our framework, configurations and schemas are 
structures of the same type, which we choose to be 
trees of nodes each containing a set of distinct at- 
tributes. Our choice has been influenced by the suc- 
cesses of the XML technologies and, naturally, we use 
XML for representing these objects. 

To exemplify, assume that in our present design, a 
grid site consists of one or more clusters each having 
a name and an architecture (homogenous), as well as 
exactly one gatekeeper for Grid access. Example 
configuration is: 

<?xml version='1.0'?> 
<site naine='FNAL' 

schema_version= ' v0_3 ' > 
<cluster naine='sainadELins' 

architecture= ' Linux ' > 
<gatekeeper . . . > 
</cluster> 
</site> 

This configuration was produced by the following 
schema 

<?xml version='1.0'?> 
<site cardinalityMin=' 1 ' 
cardinalityMax= ' 1 ' 
ncime=' inquire-def ault ,FNAL' > 
<cluster cardinalityMin= ' 1 ' 
naine= ' set , CLUSTERNAME , inquire ' 
architecture= 
' inquire-def ault , exec ,unaine ' /> 
</site> 

in an interactive session with the site administrator as 
follows: 

What is the name of the site ? [FNAL] : 

<return> 
What is the name of cluster 

at the site 'FNAL'? samadams 
What is the architecture 

of cluster 'samadams' [Linux]? 

When the schema changes or a new cluster is cre- 
ated at the site, the administrator merely needs to 
re-run the tool and answer the simple questions again. 
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Figure 2: Configuration creation and derivation in our framework. Service collections are typical of the SAMGrid 
project. 



3.2. Resource Advertisement for 
Condor-G 

In the JIM project, we have designed the grid job 
management as follows. We advertise the participat- 
ing grid clusters to an information collector and grid 
jobs are matched with clusters (resources) based 
on certain criteria primarily having to do with the 
data availability at the sites. We have implemented 
this job management using Condor-G 11] with exten- 
sions that we have designed together with the Condor 
team [Tcij . 

For the job management to work as described in 
Section [3 we need to advertise the clusters together 
with the gatekeepers as the means for Condor to 
actually schedule and execute the grid job at the re- 
mote site. Thus, our present design requires that each 
advertisement contain a cluster, a gatekeeper, a SAM 
station (for jobs actually intending to process data) 
and a few other attributes that we omit here. Our 
advertisement software then selects from the config- 
uration tree all patterns containing these attributes 
and then applies a ClassAd generation algorithm to 
each pattern. 

The selection of the subtrees that are ClassAd can- 
didates is based on the XQuery language. Our queries 
are generic enough as to allow for design evolution, i.e. 
to be resilient to some modifications in the schema. 
When new attributes are added to an clement in the 
schema, or when the very structure of the tree changes 
due to insertion of a new element, our advertisement 
service will continue to advertise these clusters with 
or without the new information (depending on how 
the advertiser itself is configured) but the important 



factor is that this site will continue to be available to 
our grid. 

For example, assume that one cluster at the site 
from subsection 13.1.11 now has a new grid gatekeeper 
mechanism from Globus Toolkit 3, in addition to the 
old one: 

<?xml version='1.0'?> 
<site naine='FNAL' 

schema_version= ' v0_3 ' > 
<cluster naine='sainadELins' 

architecture= ' Linux ' > 
<grid_accesses> 

<gatekeeper . . . > 
<gatekeeper-gtk3 ...> 
</grid_accesses> 



Assume further that our particular grid is not yet ca- 
pable of taking advantage of the new middleware and 
we continue to be interested in the old gatekeeper 
from each cluster. Our pattern was such that a gate- 
keeper is a descendant of the cluster so we continue 
to generate meaningful ClassAds and match jobs with 
this site's cluster(s). 



3.3. Monitoring Using Globus MDS 

In addition to advertising (pushing) of resource in- 
formation for the purpose of job matching, we deploy 
Globus MDS-2 for pull-based retrieval of information 
about the clusters and activities (jobs and more, such 
as data access requests) associated with them. This 
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allows us to enable web-based monitoring, primar- 
ily by humans, for performance and troubleshooting 
[lO| . We introduce (or redefine in the context of our 
project) concepts of cluster, station etc, and map 
them onto the LDAP attributes in the OID space as- 
signed to our project (the FNAL organization, to be 
exact) by the lANAjl^. We also create additional 
branches for the MDS information tree as to repre- 
sent our concepts and their relations. 

We derive the values of the dn's on the information 
tree from the site configuration. In this framework, 
it is truly straightforward to use XSLT (or a straight 
XML-parsing library) to select the names and other 
attributes of the relevant pieces of configuration. For 
example, if the site has two clusters defined in the 
configuration file, our software will automatically in- 
stantiate two branches for the information tree. Note 
that the resulting tree may of course be distributed 
within the site as we decide e.g. to run an MDS server 
at each cluster, which is a separate degree of freedom. 



3.4. Multiple Grid Instantiation and 
Inter-Operability 

We have been mentioning that there are in fact 
several other grid projects developing high-level Grid 
solutions; some of the most noteworthy include the 
European Datagrid 0|, the Crossgrid|T^, and The 
NorduGrid Inter-Operability of grids (or of so- 

lutions on The Grid if you prefer) is a well-recognized 
issue in the community. The High Energy and Nuclear 
Physics InterGrid |23| and Grid Inter-Operability [23| 
projects are some of the most prominent efforts in this 
area. As we have pointed out in the Introduction, we 
believe that inter-operability must include the abil- 
ity to instantiate and maintain multiple grid service 
suites at sites. 

A good example of inter-operability in this sense 
is given by various cooperating Web browsers which 
all understand the user's bookmarks, mail preferences 
etc.. Of course, each browser may give a different 
look and feel to its "bookmarks" menu, and other- 
wise treat them in entirely different ways, yet most 
browsers tend to save the bookmarks in the common 
HTML format, which has de facto become the stan- 
dard for bookmarks. Our framework, proposed and 
described in this Section, is a concrete means to fa- 
cilitate this aspect of inter-operability. Multiple grid 
solutions can be instantiated using a grid- neutral, site- 
oriented configuration in an XML-based format. 

We can go one step further and envisage that the 
various grids instantiated at a site have additional, 
separate configuration spaces that can easily be con- 
glomerated into a grid instantiation database. In prac- 
tice, this will allow the administrators e.g., to list all 
the Globus gatekeepers with one simple query. 



4. Integration and Project Status 



To provide a complete computing solution for the 
experiments, one must integrate grid-level services 
with those on the fabric. Ideally, grid-level schedul- 
ing complements, rather than interferes with, that of 
local batch systems. Likewise, grid-level monitoring 
should provide services that are additional (orthogo- 
nal) to those developed at the fabric's facilities (i.e., 
monitoring of clusters' batch systems, storage systems 
etc.). 

Our experiments have customized local environ- 
ments. CDF has been successfully using Cluster Anal- 
ysis Facility (CAE) , see jl^. DO has been using 
MCRunJob [23, a workflow manager which is also 
part of the CMS computing insfrastructure. An im- 
portant part of the SAMGrid project is to integrate its 
job and information services with these environments. 

For job management, we have implemented GRAM- 
compliant job managers which pass control from Grid- 
GRAM to each of these two systems (which in turn 
are on top of the various batch systems). Likewise, for 
the purposes of (job) monitoring, these systems supply 
information about their jobs to the XML databases 
which we deploy on the boundary between the Grid 
and the Fabric. (For resource monitoring, these ad- 
vertise their various properties using the frameworks 
described above). 

We delivered a complete, integrated prototype of 
SAMGrid in the Fall of 2002. Our initial testbed 
linked 11 sites (5 DO and 6 CDF) and the basic ser- 
vices of grid job submission, brokering and monitor- 
ing. Our near future plans include further work on 
the Grid-Fabric interface and more features for trou- 
bleshooting and error recovery. 



5. Summary 



We have presented the two key components of the 
SAMGrid, a SAM-based datagrid being used by the 
Run II experiments at FNAL. To the data handling 
capabilities of SAM, we add grid job scheduling and 
brokering, as well as information processing and mon- 
itoring. We use the standard Condor-G middleware 
so as to maximize the reusability of our design. As 
to the information management, we have developed 
a unified framework for configuration management in 
XML, from where we explore resource advertisement, 
monitoring and other directions such as service in- 
stantiation. We are deploying SAMGrid at the time 
of writing this paper and learning from the new expe- 
riences. 
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